GUIエージェント(Computer use)
AIエージェント
マルチエージェントシステム
Vision Language Model
マルチモーダル
OCR
視覚文書理解
退屈なことはPythonにやらせよう(オマージュ)
LLMでPCを操作!?Claudeの新機能「computer use」を早速試してみた
https://qiita.com/hedgehog051/items/eddcf26ad5d1dc8086d5
Omniparser
https://huggingface.co/microsoft/OmniParser
OmniParser for pure vision-based GUI agent
https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/
Large Language Model-Brained GUI Agents: A Survey
https://arxiv.org/abs/2411.18279
Agent S: An Open Agentic Framework that Uses Computers Like a Human
https://arxiv.org/abs/2410.08164
https://github.com/simular-ai/Agent-S
OmniParser for Pure Vision Based GUI Agent
https://arxiv.org/abs/2408.00203
OS-Atlas: A Foundation Action Model For Generalist GUI Agents
https://github.com/OS-Copilot/OS-Atlas
https://arxiv.org/abs/2410.23218
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
https://arxiv.org/abs/2404.05719
LLM でブラウザを操作する WEB エージェントと周辺技術のざっくり紹介
https://tech.algomatic.jp/entry/survey/agent/web-navigation
BrowserGym
https://github.com/ServiceNow/BrowserGym
browsergym leader board
https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard
これはもう実質AGIでは? AIが勝手にブラウザを操作していろいろやってくれちゃう BrowserUseが爆誕
https://note.com/shi3zblog/n/n960fc72b36e9?sub_rt=share_b
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
https://arxiv.org/abs/2412.17589
Python と Playwright でブラウザを自動操作させるコードを自動生成したよ
https://qiita.com/mainy/items/3a9de19f440991f67f34
browser-useの改良とAIエージェントとの繋ぎこみ
https://github.com/browser-use/web-ui
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
https://arxiv.org/abs/2412.17589
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
https://arxiv.org/abs/2501.04575
operatorの概要
https://note.com/npaka/n/nc05d63fde4bf?rt=email&sub_rt=daily_report_followee_notes
Browser UseのWeb UIを使いながらAIエージェントの業務システムへの適用を考える
https://dev.classmethod.jp/articles/browser-use-web-ui/
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
https://arxiv.org/abs/2501.12326
E2B Desktop Sandbox: GUI操作Agentのための安全な仮想環境
https://www.ai-shift.co.jp/techblog/5515
https://github.com/e2b-dev/desktop/
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
https://arxiv.org/abs/2503.21620
Computer Use〜OpenAIとAnthropicの比較と将来の展望
https://studyco.connpass.com/event/350551/presentation/?utm_campaign=new_event_links_to_group_member&utm_source=notifications&utm_medium=email&utm_content=detail_btn
Computer-Using Agent向け日本語VLM「KARAKURI VL」を試す
https://zenn.dev/kun432/scraps/896f0ef6490adf
GTA1: GUI Test-time Scaling Agent
https://arxiv.org/pdf/2507.05791